Joint probability distribution

In the study of probability, given two random variables X and Y that are defined on the same probability space, the joint distribution for X and Y defines the probability of events defined in terms of both X and Y. In the case of only two random variables, this is called a bivariate distribution, but the concept generalizes to any number of random variables, giving a multivariate distribution. the equation for joint probability is different for both dependent and independent events.

Contents

Example

Consider the roll of a die and let A=1 if the number is even (i.e. 2,4, or 6) and A=0 otherwise. Furthermore, let B=1 if the number is prime (i.e. 2,3, or 5) and B=0 otherwise. Then, the joint distribution of A and B is


  \mathrm{P}(A=0,B=0)=P\{1\}=\frac{1}{6},\; \mathrm{P}(A=1,B=0)=P\{4,6\}=\frac{2}{6}

  \mathrm{P}(A=0,B=1)=P\{3,5\}=\frac{2}{6},\; \mathrm{P}(A=1,B=1)=P\{2\}=\frac{1}{6}

Cumulative distribution

The cumulative distribution function for a pair of random variables is defined in terms of their joint probability distribution;

F(x,y)=P(X \le x, Y \le y) .

Discrete case

The joint probability mass function of two discrete random variables is equal to


\begin{align}
\mathrm{P}(X=x\ \mathrm{and}\ Y=y) & {} = \mathrm{P}(Y=y \mid X=x) \cdot \mathrm{P}(X=x) \\
& {} = \mathrm{P}(X=x \mid Y=y) \cdot \mathrm{P}(Y=y).
\end{align}

In general, the joint probability distribution of n discrete random variables X_1,...,X_n is equal to


\begin{align}
\mathrm{P}(X_1=x_1,\dots,X_n=x_n) = \; & \mathrm{P}(X_1=x_1)\cdot \\ & {} \mathrm{P}(X_2=x_2|X_1=x_1)\cdot \\ & \mathrm{P}(X_3=x_3|X_1=x_1,X_2=x_2) \cdot \\ & ... \\ & P(X_n=x_n|X_1=x_1,\dots,X_{n-1}=x_{n-1})
\end{align}

This identity is known as the chain rule of probability.

Since these are probabilities, we have

\sum_x \sum_y \mathrm{P}(X=x\ \mathrm{and}\ Y=y) = 1.\;

Continuous case

Similarly for continuous random variables, the joint probability density function can be written as fX,Y(xy) and this is

f_{X,Y}(x,y) = f_{Y|X}(y|x)f_X(x) = f_{X|Y}(x|y)f_Y(y)\;

where fY|X(y|x) and fX|Y(x|y) give the conditional distributions of Y given X = x and of X given Y = y respectively, and fX(x) and fY(y) give the marginal distributions for X and Y respectively.

Again, since these are probability distributions, one has

\int_x \int_y f_{X,Y}(x,y) \; dy \; dx= 1.

Mixed case

In some situations X is continuous but Y is discrete. For example, in a logistic regression, one may wish to predict the probability of a binary outcome Y conditional on the value of a continuously-distributed X. In this case, (X, Y) has neither a probability density function nor a probability mass function in the sense of the terms given above. On the other hand, a "mixed joint density" can be defined in either of two ways:


\begin{align}
f_{X,Y}(x,y) &= f_{X|Y}(x|y)\mathrm{P}(Y=y)\\
             &= \mathrm{P}(Y=y \mid X=x) f_X(x)
\end{align}

Formally, fX,Y(x, y) is the probability density function of (X, Y) with respect to the product measure on the respective supports of X and Y. Either of these two decompositions can then be used to recover the joint cumulative distribution function:


\begin{align}
F_{X,Y}(x,y)&=\sum\limits_{t\le y}\int_{s=-\infty}^x f_{X,Y}(s,t)\;ds
\end{align}

The definition generalizes to a mixture of arbitrary numbers of discrete and continuous random variables.

General multidimensional distributions

The cumulative distribution function for a vector of random variables is defined in terms of their joint probability distribution;

F(x_1,\dots,x_n)=P(X_1 \le x_1,\dots, X_n \le x_n) .

The joint distribution for two random variables can be extended to many random variables X1, ... Xn by adding them sequentially with the identity

\begin{align} f_{X_1, \ldots X_n}(x_1, \ldots x_n) =& f_{X_n | X_1, \ldots X_{n-1}}( x_n | x_1, \ldots x_{n-1}) f_{X_1, \ldots X_{n-1}}( x_1, \ldots x_{n-1} )\\
=& f_{X_1} (x_1) \\
 & \cdot f_{X_2|X_1} (x_2|x_1)\\
 & \cdot \dots \\
 & \cdot f_{X_{n-1}| X_1 \ldots X_{n-2}}(x_{n-1}| x_1, \ldots x_{n-2} ) \\
 & \cdot f_{X_n | X_1, \ldots X_{n-1}}( x_n | x_1, \ldots x_{n-1}),\end{align}

where

\begin{align}
f_{X_i| X_1, \ldots X_{i-1}}(x_i | x_1, \ldots x_{i-1})=
  &\frac{f_{X_1, \dots X_i}(x_1,\dots x_i)}{\int f_{X_1, \dots X_i}(x_1,\dots x_{i-1},u_i) \mathrm{d} u_i}\\
= &\frac{\int \dots \int f_{X_1, \dots X_n}(x_1,\dots x_i,u_{i%2B1}, \dots u_n) \mathrm{d} u_{i%2B1}\dots \mathrm{d}u_n}{\int \dots \int \int f_{X_1, \dots X_n}(x_1,\dots x_{i-1},u_i, \dots u_n) \mathrm{d} u_i \,\mathrm{d} u_{i%2B1}\dots \mathrm{d}u_n}
\end{align}

and

f_{X_1,\dots X_i}(x_1,\dots x_i) = \int \dots \int f_{X_1,\dots X_n}(x_1,\dots x_i,x_{i%2B1},\dots x_n) \mathrm{d} x_{i%2B1} \dots \mathrm{d} x_n

(notice, that these latter identities can be useful to generate a random variable (X_1, \dots X_n) with given distribution function f(x_1,\dots x_n)); the density of the marginal distribution is

f_{X_i}(x_i) = \int \dots \int \int \dots \int f_{X_1,\dots X_n}(x_1,\dots x_{i-1},x_i,x_{i%2B1},\dots x_n) \mathrm{d} x_1\dots \mathrm{d}x_{i-1} \, \mathrm{d}x_{i%2B1} \dots \mathrm{d}x_n.

The joint cumulative distribution function is

F_{X_1,\dots X_n}\left( x_1, \dots x_n\right)= \int_{-\infty}^{x_1} \dots \int_{-\infty}^{x_n} f_{X_1,\dots X_n}\left(u_1,\dots u_n\right) \mathrm{d} u_1 \dots \mathrm{d}u_n,

and the conditional distribution function is accordingly

\begin{align}
F_{X_i| X_1, \ldots X_{i-1}}(x_i| x_1, \ldots x_{i-1})=
  &\frac{\int_{-\infty}^{x_i}f_{X_1, \dots X_i}(x_1,\dots x_{i-1},u_i)\mathrm{d}u_i}{\int_{-\infty}^\infty f_{X_1, \dots X_i}(x_1,\dots x_{i-1},u_i) \mathrm{d} u_i}\\
= &\frac{\int_{-\infty}^\infty \dots \int_{-\infty}^\infty \int_{-\infty}^{x_i} f_{X_1, \dots X_n}(x_1,\dots x_{i-1},u_i, \dots u_n) \mathrm{d} u_i\dots \mathrm{d}u_n}{\int_{-\infty}^\infty \dots \int_{-\infty}^\infty \int_{-\infty}^\infty f_{X_1, \dots X_n}(x_1,\dots x_{i-1},u_i,\dots u_n) \mathrm{d} u_i \dots \mathrm{d} u_n}.
\end{align}

Expectation reads

\mathbb{E}\left[h(X_1,\dots X_n) \right]=\int_{-\infty}^\infty \dots \int_{-\infty}^\infty h(x_1,\dots x_n) f_{X_1,\dots X_n}(x_1,\dots x_n) \mathrm{d} x_1 \dots \mathrm{d} x_n;

suppose that h is smooth enough and h(u_1,\dots u_n)=h(x_1,\dots x_n) for u_1 \ge x_1, \dots u_n\ge x_n, then, by iterated integration by parts,

\begin{align}\mathbb{E}\left[h(X_1,\dots X_n) \right]=& h(x_1,\dots x_n)%2B \\
& (-1)^n \int_{-\infty}^{x_1} \dots \int_{-\infty}^{x_n} F_{X_1,\dots X_n}(u_1,\dots u_n) \frac{\partial^n}{\partial x_1 \dots \partial x_n} h(u_1,\dots u_n) \mathrm{d} u_1 \dots \mathrm{d} u_n.\end{align}

Joint distribution for independent variables

If for discrete random variables \ P(X = x \ \mbox{and} \ Y = y ) = P( X = x) \cdot P( Y = y) for all x and y, or for absolutely continuous random variables \ f_{X,Y}(x,y) = f_X(x) \cdot f_Y(y) for all x and y, then X and Y are said to be independent.

Joint Distribution for conditionally independent variables

If a subset A of the variables X_1,\cdots,X_n is conditionally independent given another subset B of these variables, then the joint distribution \mathrm{P}(X_1,...,X_n) is equal to P(B)\cdot P(A|B). Therefore, it can be efficiently represented by the lower-dimensional probability distributions P(B) and P(A|B). Such conditional independence relations can be represented with a Bayesian network.

See also

External links